Genomic epidemiology and emergence of SARS-CoV-2 variants of concern in the United Arab Emirates

Since the declaration of SARS-CoV-2 outbreak as a pandemic, the United Arab Emirates (UAE) public health authorities have adopted strict measures to reduce transmission as early as March 2020. As a result of these measures, flight suspension, nationwide RT-PCR and surveillance of viral sequences were extensively implemented. This study aims to characterize the epidemiology, transmission pattern, and emergence of variants of concerns (VOCs) and variants of interests (VOIs) of SARS-CoV-2 in the UAE, followed by the investigation of mutations associated with hospitalized cases. A total of 1274 samples were collected and sequenced from all seven emirates between the period of 25 April 2020 to 15 February 2021. Phylogenetic analysis demonstrated multiple introductions of SARS-CoV-2 into the UAE in the early pandemic, followed by a local spread of root clades (A, B, B.1 and B.1.1). As the international flight resumed, the frequencies of VOCs surged indicating the January peak of positive cases. We observed that the hospitalized cases were significantly associated with the presence of B.1.1.7 (p < 0.001), B.1.351 (p < 0.001) and A.23.1 (p = 0.009). Deceased cases are more likely to occur in the presence of B.1.351 (p < 0.001) and A.23.1 (p = 0.022). Logistic and ridge regression showed that 51 mutations are significantly associated with hospitalized cases with the highest proportion originated from S and ORF1a genes (31% and 29% respectively). Our study provides an epidemiological insight of the emergence of VOCs and VOIs following the borders reopening and worldwide travels. It provides reassurance that hospitalization is markedly more associated with the presence of VOCs. This study can contribute to understand the global transmission of SARS-CoV-2 variants.


Methodology
Ethics statement. This study has been approved by the local ethics committee at Abu Dhabi Health COVID-19 Research Ethics Committee (DOH/DQD/2020/538), SEHA Research Ethics Committee (SEHA-IRB-005) and Ministry of Health and Prevention (MOHAP/DXB-REC/ AAA/No. 80/2021). This study was conducted in accordance with international ethical standards (Declaration of Helsinki 1964) and UAE federal law No. (4) of 2016. Participant information was coded and held securely in compliance with the Data Protection Regulation of Khalifa University. Informed Consent was obtained from a family member of patients who were on ventilators with a signed agreement by a supervising physician. All data were de-identified prior to use. Study population and data collection period. This cross-sectional study recruited a total of 1,538 participants that have been tested positive for SARS-CoV-2 by quantitative real-time Polymerase Chain Reaction (qPCR) if the cycle threshold (Ct) value was 36 or less via nasopharyngeal swabs. Samples were collected between 25 April 2020 to 15 February 2021 from multiple sites across the seven emirates in the UAE (Abu Dhabi, Dubai, Sharjah, Ajman, Umm Al Quwain, Ras Al Khaimah, and Fujairah) from multiple medical centers, hospitals, quarantine camps and non-quarantine facilities ( Supplementary Fig. 1). Extracted RNA from SARS-CoV-2 samples was amplified by WHO-recommended primers and probes targeting the ORF, N and S genes. Demographic and clinical data for SARS-CoV-2 sequenced samples in UAE are shown in Table 1. Due to the heterogeneous nature of COVID-19's phenotype spectrum, a broad definition was utilized to categorize the severity status of the affected cases into home quarantine, hospitalized and deceased. . Primers used to generate amplicons from Viral RNA are removed during the tagmentation step of the library preparation protocol. During tagmentation, Amplicons are fragmented and tagged with adapters and bound on to the tagmentation beads. Primers, buffers and other reagents from amplification step are removed during the washes after tagmentation before proceeding with Indexing PCR. SARS-CoV-2 genome assembly and multiple sequence alignment. In-house CovSeq pipeline was used in this study, following the best practices and instructions recommended by the Broad Institute's Genome Analysis ToolKit (GATK) 23 . All CovSeq reads were checked for quality using FastQC software version 0.11.5 24 . Low quality reads, primers and Illumina adapters were removed using Trimmomatric tool version 0.33.0 25 . Trimmed reads were mapped to SARS-CoV-2 reference genome (Wuhan-Hu-1-NC_045512.2/MN908947.3 using Burrows-wheeler Aligner (BWA) v.0.7.12 (BWA-MEM) 26  Context selection and phylogenetic tree generation. A total of 1,274 SARS-CoV-2 sequences were quality filtered (see below) and used as seed for context selection: a context database suitable for BLAST queries was created using 399,124 SARS-Cov-2 sequences in GISAID as per February 16, 2021. All local sequences were compared to all GISAID sequences using Nucleotide-Nucleotide BLAST 2.6.0 (blastn) 31 , retaining up to 30 matches per query sequence, with maximum 20 mismatches. Further, a quota of maximal 100 sequences per country is introduced to counter-balance the heterogeneity in national sequencing efforts. The rationale behind this approach is to construct a phylogenetic tree that includes all sequences most relevant to the local samples.
After context selection, 3267 sequences were used to construct the phylogenetic tree using Augur 30 . SARS-CoV-2 Fasta and metadata files were filtered, and aligned to the reference sequence (NC_045512.2/ MN908947.3) using MAFFT v.2 32 , whereas any sequence sites with potential errors were masked 33 . The phylogeny tree was constructed using augur commands tree and refine, which in turn deploy IQ-TREE v1.6.8 and TreeTime 34 . Other augur subcommands were utilized to reconstruct mutations, label clades, and infer geographic movement which can be visualized on Auspice. The entire workflow is managed with snakemake 35 . Detection of multiple introductions of VOCs. We repeat the above steps for the subclades of VOCs B.1.1.7 (alpha) and B.1.351 (beta) by selecting all UAE based variants as queries (377 and 39, respectively) contextualized with similar BLAST hits in GISAID sequences according to the pangolin lineage, yielding variant specific contexts with a total of 597 and 237 sequences, respectively. Augur (using TreeTime) also estimates the www.nature.com/scientificreports/ origin of ancestral nodes, with the caveat that this estimate is skewed towards country-based sequencing efforts. The phylogenies from those extended contexts facilitate the identification of likely local transmission events (and by contrast) international introductions and demonstrate the genetic diversity.
Phylodynamic analysis. We deploy TreeTime, to plot the history of effective population size, also known as skyline. TreeTime maximizes the coalescence likelihood from a phylogenetic tree similar to other state of the art tools like BEAST 36 , but compares favorably with regard to computational efficiency. We therefore could calculate the effective population size based on all sequences sampled in the UAE, without the need for downsampling. The exact parameter settings are provided in the "Supplementary material".
Filtration and samples inclusion. Samples with incomplete demographic data (age, gender, nationality, and patient status) were removed from the analysis. In the combined metadata file comprising local and international samples, the records with missing information such as date of collection were filtered out and not included in the phylogenetic tree analysis. Nextclade 0.14.1 default quality control was adopted in this study for mutation call 30 . Samples that did not pass Nextclade quality check were excluded. The quality control used by Nextclade includes the number missing and ambiguous nucleotides, degree of divergence, and clustered differences. The total number of samples that failed Nextclade quality check were 47. In addition, samples with poor genomic coverage (i.e., coverage < 10×, n = 219) were excluded from this study. In total, two hundred and sixtyfour samples were excluded from this study. The final study population that passed filtration and included in the analysis were 1274.
Severity trait locus mapping. Single-nucleotide mutations of each variant were extracted from the aligned sequences via SNP-sites into a VCF file before converting into PLINK formatted files 37 . The PLINK file was augmented with age and sex information as covariates to model the severity as a binary trait. We deployed regenie (https:// rgcgi thub. github. io/ regen ie/) to conduct a whole genome regression on the severity trait 38 . The options set were for modeling binary traits, with a genotype block size of 100, and approximate Firth likelihood ratio test for p < 0.01. Further validation was conducted through PLINK's Assoc command which yields whole genome association analyses with adjusted p-values.
Statistical analysis. The descriptive variables were verified using frequency analysis. Pearson Chi-square test was used to study categorical variables via cross-tabulation. Multivariate logistic regression models tested multivariate relationships between symptom severity and the presence of the variant of concern. Multivariate logistic regression models and whole genome regression with the use of regenie tool was used to test multivariate relationships between symptom severity and the presence of mutations. All regression models accounted for age (continuous) and gender (bivariate: male/female). The significance level adopted for all analyses was p < 0.05. For the mutational analyses, we controlled for multiple testing using the Bonferroni correction for 77 comparisons (number of mutations) to an alpha level of 0.05, resulting in the corrected threshold of 0.05/77 = 6.49 × 10 −4 . All statistical analyses were performed with Statistical Package for Social Science (SPSS) version 20 and R (Version 3.4.1).

Results
Patient characteristics. A total of 1274 patients, whose geographical, demographic, and clinical characteristic are shown in Table 1, further stratified by VOC/VOI identification. Participants were recruited from seven emirates across the UAE: 95% from Abu Dhabi, 5% from Sharjah, 0.3% from Ajman, 0.4% Umm Al-Quwain, 0.2% from Ras Al-Khaimah, and 0.2% from Fujairah. Of these cases, 59% were males and 41% were female, with the highest proportion of cases in the > 28 age, and from Middle East (49%) and Asia (41%) group. The majority of patients were home quarantine (81%), whereas 17% were hospitalized and 2% were deceased. Of   Supplementary Fig. 3 demonstrates the daily new confirmed COVID-19 cases (Supplementary Fig. 3A) and daily new deaths ( Supplementary Fig. 3B), collected from the official National Crisis and Emergency Management Authority (NCEMA) in the UAE, alongside time in which the mitigation measures were put in place (Supplementary Fig. 3C). As reflected, the number of COVID-19 cases and deaths decreased after strong mitigation measures implemented by the government in March 2020 ( Supplementary Fig. 3C). However, shortly after the borders were opened in July 2020, cases started to slowly surge. Supplementary Fig. 3A demonstrates the COVID-19 confirmed cases in the UAE, stratified by the estimated frequency data of VOC vs. Non-VOC. The estimated frequency data of VOC vs. Non-VOC was extrapolated from the sequencing data of this study throughout the time and applied to the NCEMA figures as an estimation analysis. The general wave structure was corroborated through estimation of effective population size including, as demonstrated in Supplementary Fig. 4.   Fig. 3).

Patient status and VOC/VOI.
The B.1.1.7 variant was identified in 369 cases, with approximately 50 introductions and multiple local transmissions across the UAE, suggesting a widespread local transmission and diversification. The B.1.351 variant was identified in 37 cases, with approximately 15 introductions and one mass spread event infecting 9 cases simultaneously, this spread is highlighted in pink (Fig. 2). However, it should be noted that the quantification the amount of VOC introductions is biased by strongly differing sampling efforts per country and reporting to GISAID. We do however identify a broad phylogenetic diversity, which is highly unlikely to be caused only by local transmissions and based on Augur's origin estimation for ancestral nodes-the result of multiple international introductions. For some sub-clades (see Fig. 2 Emergence of VOCs in the UAE. In our dataset, we identified 2777 different mutations affecting the protein amino acid sequence in the patient sample. The average number of mutations presented in each category were as follows: 14.8 mutation in non-hospitalized and 21.6 in the hospitalized group. Only mutations that are present in 5% of the samples were selected for mutation analysis. The number of mutations based on the above criteria for non-hospitalized, and hospitalized were 35 and 77, respectively. The analysis across all 77 mutations Table 2. Association of SARS-CoV-2 VOC/VOI infections to clinical severity status. Chi-squared test of significance was used to measure associations between reference category (Home Quarantine) and each category in the model. Multivariate analysis (Home Quarantine vs Hospitalized; Home Quarantine vs Deceased) was used for the regression models, presented as unadjusted OR and adjusted OR for age and gender. NA: a regression analysis was not conducted due to the lack of hospitalized/deceased participants in the case group. CI confidence interval, NA not applicable, OR odds ratio. To assess the mutations that are related to hospitalized cases, logistic and ridge regression analysis was conducted on mutations that showed significant association with severity (n = 37). The infection of the mutations adjusted for age and gender were more likely to be associated with hospitalized cases than non-hospitalized (Supplementary Table 2), after Bonferroni correction at p > 6.49 × 10 −4 . The highest proportion of mutations were originated from S and ORF1a genes (35% and 29% respectively). Additional mutations associated with the hospitalized cases of COVID-19 are outlined in Supplementary Table 2. The structural and accessory proteins of SARS-CoV-2 that are significantly associated with hospitalized COVID-19 cases after adjustment for age and gender, and Bonferroni correction at p > 6.49 × 10 −4 , is summarized in Table 3. The complete list of mutations correlated to hospitalized status is presented in Supplementary Table 2. A Manhattan plot (Fig. 3) and the output of regenie's GWAS on the corresponding SNPs (Supplementary Table 3) was generated from the ridge regression analysis were regenie tool was deployed to conduct whole genome regression on the severity trait.

Discussion
For the first time, this study demonstrates the entry of the new SARS-CoV-2 variants of concern and interests, and the outbreak dynamics in the UAE. Global massive ongoing transmission and the continuous evolution of new strains demonstrates that strict mitigation measures are important to effectively control the spread of the virus. To do so, a better understanding of the phylogenomic spread and transmission dynamics could contribute to more targeted and effective responses to the pandemic.
The analysis of 1274 viral genomes collected in the UAE, indicates the presence of 11 major clades. The occurrence of the root clades A and East Asian B was clearly seen in the early months of 2020 suggesting early spatiotemporal introduction into the UAE. Distribution of B.1 and B.1.1, which are descendants containing the spike mutation D614G, began in early May 2020 despite the vigilant health measures, which could suggest the multiple independent entry from Europe, Asia, and Middle East prior to the national lockdown. As the nation-wide public health measures were implemented, B.1.1 distributed locally until late July 2020. Despite lockdown and strict measures, we have observed a substantial local transmission within Abu Dhabi and Dubai, in addition to a low  Table 3. Brief description of various structural and accessory proteins of SARS-CoV-2 that are significantly associated with hospitalized COVID-19 cases after adjustment for age and gender, and Bonferroni correction at p > 6.49 × 10 −4 .

Protein name Coding region Mutations Role
Spike protein (n = 13)  S  A243, A570D, D1118H, D215G, F157L, H69-V70, N501Y,  P681H, Q613H, S982A, T716I, V367F, Y144 Binds to ACE2 host cell receptor and mediates viral entry within the host cell 15,39 Nucleocapsid protein (n = 6) N D3L, M1X, R203K, S194L, S235F, S2Y Roles in Encapsulates viral nucleic acid 40 We have performed mutation analysis to define any significant correlation between patient severity and mutations resulting in amino acids sequence changes. A total of 37 structural and accessory proteins of SARS-CoV-2 are significantly associated with hospitalized COVID-19 cases after adjustment for age and gender, and Bonferroni correction. Overall, we have observed more mutation in the structural spike protein (n = 13). We identified four major mutations of concerns in spike region that are associated with hospitalized cases in our study. N501Y that presents in B.1.1.7 and B.1.351 lineages has been reported to increase ACE2-binding affinity 47 and as a mean of immune escape 39 . Other mutations such as A570D, D1118H, P681H, S982A, T716I, two deletions H69-V70 and Y144 in spike protein, in addition to D3L, S194L and S235F in nucleocapsid protein were found in B.1.1.7 lineage are in accordance with the studies indicating the high risk of hospital admission and severe disease in B.1.1.7 patients compared to wild-type variant 45,48 . B.1.351 lineages mutations found in this study such as A701V were reported by Campbell et al. 49 to increase transmissibility by 25% and death in the hospitalized patients by 20%. Other mutations reported in spike and nucleocapsid regions (S: A243; N: M1X and S2Y) in this study have been associated with hospitalized cases, yet no studies have shown any association between these mutations and severity. The importance of ORF1a and ORF1b have been reported in viral replication, transcription, morphogenesis, and evasion of the host of the immune response. Concordant to our results, A1708D, I2230T, and T1001I mutations in ORF1a found in alpha lineages have been associated with hospital admission 45 . The remaining mutations in ORF1ab (L730F, M372I, T350N, A1708D, I2230T, T1001I, K3353, F3677, G3676, L3667F, S3675, T239I, K1383R) have not been reported to correlate to the severity in other studies.
Other significant correlations were reported between hospitalized outcome and accessory proteins such as ORF8 and ORF9b. Although accessory proteins are not involved in virus replication, accumulating evidence demonstrated their critical roles in viral pathogenesis. Most mutations in accessory proteins were at ORF8 which were not identified in other studies. ORF8 was found to induce major histocompatibility class I (MHC1) downregulation, thus providing protection against cytotoxic T cells (CTLs) 50 . In addition, ORF8 expressing cell and SARS-CoV-2 infected cells are resistant to CLT lysis, which was restored with knockdown of ORF8 expression 50,51 . It is suggested that SARS-CoV-2 could potentially benefit from missense mutations in ORF8 protein to evade immune surveillance 51 . We also identified K68, Q27, R52I, and Y73C mutation in ORF8, and R32P mutation in ORF9b in hospitalized patient. Mutations in ORF9b has been reported to interact with the mitochondria outer membrane protein (TOM70), thus suppresses interferon response 43 .
Limitation of the study should be addressed. At the beginning of pandemic, most patients (asymptomatic and symptomatic) were admitted to the hospital or quarantine areas which could not necessarily reflect the severity of the patient. Therefore, due to the complex nature of the COVID-19 phenotype presentation, statistical and methodological heterogeneity may be present. Also, the admission of patients may be influenced by other factors such as immediate status, comorbidities, and age. Second, the classification of ethnicity might be impression due to using nationality recorded from official passport as a surrogate for ethnicity. Epidemiological features such as travel-related, comorbidities, treatments and severe admission were limited in this study, which impacted post-hoc adjustment analysis. It is clearly noted that 95% of the patients were from the Emirate of Abu Dhabi which indicates the necessity of including further samples from other Emirates. Our mutation analysis may have sampling bias, since only 17% of patients were hospitalized, whereas the remaining were non-hospitalized.
Our study provides an epidemiological insight into the emergence of VOCs and VOIs following borders reopen and worldwide travels. It provides reassurance that hospitalization is markedly more associated with the presence of VOCs. The major strength of this study was the comprehensive longitudinal analysis which covered the early months of COVID-19 in UAE, until the peak of the 3rd wave in February 2021. However, the collection of good quality data such as vaccine status, severity, and travel history in combination with rapid genome sequence are imperative in understanding the behavior and role of variants related to clinical outcomes. This study can contribute to understanding the global transmission of SARS-CoV-2 variants.